This project looks into the NYPD Shooting Incident Data obtained from https://catalog.data.gov/dataset/nypd-shooting-incident-data-historic. The data describes shooting incidents in New York City dating back to 2006 and provides information about exact location, time, and information about the victim and suspect.Through this analysis, we aim to identify key factors that contribute to fatal shootings and to explore how variables such as borough, race, age, and gender of both the victim and perpetrator play a role in the likelihood of a shooting incident resulting in death.
Columns
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(lubridate)
library(leaflet)
library(leaflet.extras)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
##
## The following object is masked from 'package:ggplot2':
##
## margin
library(e1071)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(caTools)
url_in <- "https://data.cityofnewyork.us/api/views/833y-fsy8/rows.csv?accessType=DOWNLOAD"
shooting_data <- read_csv(url_in)
## Rows: 28562 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (12): OCCUR_DATE, BORO, LOC_OF_OCCUR_DESC, LOC_CLASSFCTN_DESC, LOCATION...
## dbl (7): INCIDENT_KEY, PRECINCT, JURISDICTION_CODE, X_COORD_CD, Y_COORD_CD...
## lgl (1): STATISTICAL_MURDER_FLAG
## time (1): OCCUR_TIME
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
### remove columns not needed
shooting_data_cleaned <- shooting_data %>% select(-LOCATION_DESC, -LOC_OF_OCCUR_DESC, -LOC_CLASSFCTN_DESC, -Lon_Lat)
### convert the OCCUR_DATE into a date object
shooting_data_cleaned <- shooting_data_cleaned %>% mutate(OCCUR_DATE = mdy(OCCUR_DATE))
### arrange from oldest to newest by OCCUR_DATE
shooting_data_cleaned <- shooting_data_cleaned %>% arrange(OCCUR_DATE)
### convert OCCUR_TIME into time format
shooting_data_cleaned$OCCUR_TIME <- hms(shooting_data_cleaned$OCCUR_TIME)
### convert categorical variables into factors
shooting_data_cleaned <- shooting_data_cleaned %>% mutate(
BORO = factor(BORO),
PRECINCT = factor(PRECINCT),
JURISDICTION_CODE = factor(JURISDICTION_CODE),
PERP_AGE_GROUP = factor(PERP_AGE_GROUP),
PERP_SEX = factor(PERP_SEX),
PERP_RACE = factor(PERP_RACE),
VIC_AGE_GROUP = factor(VIC_AGE_GROUP),
VIC_SEX = factor(VIC_SEX),
VIC_RACE = factor(VIC_RACE))
### Get rid of NAs
shooting_data_cleaned <- shooting_data_cleaned %>% filter(
!is.na(JURISDICTION_CODE),
!is.na(PERP_SEX),
!is.na(PERP_AGE_GROUP),
!is.na(Longitude),
!is.na(Latitude))
### combine unknown and null values in column
shooting_data_cleaned <- shooting_data_cleaned %>%
mutate(PERP_AGE_GROUP = recode(PERP_AGE_GROUP,
"UNKNOWN" = "Unknown",
"(null)" = "Unknown"))
shooting_data_cleaned <- shooting_data_cleaned %>%
mutate(PERP_RACE = recode(PERP_RACE,
"UNKNOWN" = "Unknown",
"(null)" = "Unknown"))
shooting_data_cleaned <- shooting_data_cleaned %>%
mutate(PERP_SEX = recode(PERP_SEX,
"U" = "Unknown",
"(null)" = "Unknown"))
###recode "U"/"UNKNOWN" to be "Unknown" for consistancy
shooting_data_cleaned <- shooting_data_cleaned %>%
mutate(VIC_SEX = recode(VIC_SEX,
"U" = "Unknown"))
shooting_data_cleaned <- shooting_data_cleaned %>%
mutate(VIC_AGE_GROUP = recode(VIC_AGE_GROUP,
"UNKNOWN" = "Unknown"))
shooting_data_cleaned <- shooting_data_cleaned %>%
mutate(VIC_RACE = recode(VIC_RACE,
"UNKNOWN" = "Unknown"))
### locate unique values and clean up
unique(shooting_data_cleaned$PERP_AGE_GROUP)
## [1] 25-44 18-24 Unknown 45-64 <18 65+ 224 940 1020
## Levels: Unknown <18 1020 1028 18-24 224 25-44 45-64 65+ 940
shooting_data_cleaned <- shooting_data_cleaned %>%
filter(!PERP_AGE_GROUP %in% c("224", "940", "1020", "1028")) %>%
droplevels() #remove empty factor level
unique(shooting_data_cleaned$VIC_AGE_GROUP)
## [1] 25-44 <18 18-24 45-64 65+ Unknown 1022
## Levels: <18 1022 18-24 25-44 45-64 65+ Unknown
shooting_data_cleaned <- shooting_data_cleaned %>%
filter(!VIC_AGE_GROUP %in% c("1022")) %>%
droplevels()
### obtain a summary
summary(shooting_data_cleaned)
## INCIDENT_KEY OCCUR_DATE OCCUR_TIME
## Min. : 9953245 Min. :2006-01-01 Min. :0S
## 1st Qu.: 51724135 1st Qu.:2008-09-27 1st Qu.:3H 52M 0S
## Median : 86012377 Median :2012-07-30 Median :15H 25M 0S
## Mean :122468275 Mean :2013-12-27 Mean :12H 59M 41.4232757270365S
## 3rd Qu.:205280615 3rd Qu.:2019-11-17 3rd Qu.:20H 37M 0S
## Max. :279758069 Max. :2023-12-29 Max. :23H 59M 0S
##
## BORO PRECINCT JURISDICTION_CODE STATISTICAL_MURDER_FLAG
## BRONX :5849 75 : 1070 0:16200 Mode :logical
## BROOKLYN :7039 73 : 913 1: 67 FALSE:15337
## MANHATTAN :2714 44 : 746 2: 2886 TRUE :3816
## QUEENS :2896 47 : 746
## STATEN ISLAND: 655 46 : 724
## 67 : 644
## (Other):14310
## PERP_AGE_GROUP PERP_SEX PERP_RACE
## Unknown:4262 Unknown: 2580 Unknown : 2918
## <18 :1673 F : 443 AMERICAN INDIAN/ALASKAN NATIVE: 2
## 18-24 :6424 M :16130 ASIAN / PACIFIC ISLANDER : 169
## 25-44 :6032 BLACK :11878
## 45-64 : 697 BLACK HISPANIC : 1388
## 65+ : 65 WHITE : 298
## WHITE HISPANIC : 2500
## VIC_AGE_GROUP VIC_SEX VIC_RACE
## <18 :2134 F : 2060 AMERICAN INDIAN/ALASKAN NATIVE: 9
## 18-24 :6793 M :17084 ASIAN / PACIFIC ISLANDER : 343
## 25-44 :8601 Unknown: 9 BLACK :13012
## 45-64 :1405 BLACK HISPANIC : 1941
## 65+ : 161 Unknown : 52
## Unknown: 59 WHITE : 582
## WHITE HISPANIC : 3214
## X_COORD_CD Y_COORD_CD Latitude Longitude
## Min. : 920263 Min. :127539 Min. :40.52 Min. :-74.23
## 1st Qu.: 999876 1st Qu.:183445 1st Qu.:40.67 1st Qu.:-73.94
## Median :1007969 Median :197702 Median :40.71 Median :-73.91
## Mean :1009049 Mean :209630 Mean :40.74 Mean :-73.91
## 3rd Qu.:1017042 3rd Qu.:240747 3rd Qu.:40.83 3rd Qu.:-73.88
## Max. :1065474 Max. :271128 Max. :40.91 Max. :-73.71
##
### The Unknowns in the applicable columns make sense for that category
# Get Yearly data
yearly_data_for_line_plot <- shooting_data_cleaned %>%
mutate(Year = year(OCCUR_DATE)) %>%
group_by(Year) %>%
summarise(Incidents = n())
# What time of day or week do most shooting incidents occur?
times <- shooting_data_cleaned %>%
mutate(
Year = year(OCCUR_DATE),
Month = month(OCCUR_DATE),
Day = day(OCCUR_DATE),
Hour = hour(OCCUR_TIME),
Weekday = wday(OCCUR_DATE, label = TRUE)
)
# Which Boroughs have the highest shooting incidents?
shooting_count_by_boro <- shooting_data_cleaned %>%
group_by(BORO) %>%
summarise(Incidents = n()) %>%
arrange(desc(Incidents))
incidents_by_hour <- times %>%
group_by(Hour) %>%
summarise(Incidents = n()) %>%
ggplot(aes(x = Hour, y = Incidents)) +
geom_line(color = "pink", linewidth = 1) +
labs(
title = "Shooting Incidents by Hour of the Day",
x = "Hour of the Day",
y = "Number of Incidents"
) +
theme_minimal()
ggplotly(incidents_by_hour)
ggplot(yearly_data_for_line_plot, aes(x = Year, y = Incidents)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "red", size = 2) +
labs(title = "Yearly Shooting Incidents", x = "Year", y = "Number of Incidents") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Line Chart 1 shows that most shooting incidents occurred in the cover of night either the early hours of the morning or later times in the day. As the sun sets, the number of shooting incidents seem to increase.
Line Chart 2 shows a decline in shooting incidents over years of 2006 to 2019 and then a sudden rise starting from 2020 to 2022 before starting to decrease again. There are some notable events that may have contributed to the increase in the number of shootings from 2019 to 2022 in New York such as the George Floyd protests as well as changes to laws that aimed to decrease the population of inmates in jail leading to repeat offenders being in the population. The sudden decrease in shooting incidents in 2023 could be the result of New York’s efforts to decrease the number of firearms in high-crime neighborhoods.
shooting_by_boro_bar <- shooting_count_by_boro %>%
ggplot(aes(x = reorder(BORO, Incidents), y = Incidents, fill = BORO)) +
geom_bar(stat = "identity", show.legend = FALSE) +
labs(
title = "Number of Shooting Incidents by Borough",
x = "Borough",
y = "Number of Shooting Incidents"
) +
theme_dark() +
coord_flip()
shooting_by_boro_bar
leaflet(shooting_data_cleaned) %>%
addTiles() %>%
addHeatmap(~Longitude, ~Latitude, blur = 5, max = 1, radius = 11) %>%
setView(lng = mean(shooting_data_cleaned$Longitude, na.rm = TRUE),
lat = mean(shooting_data_cleaned$Latitude, na.rm = TRUE),
zoom = 11)
There are shooting incidents all over New York, however the heatmap shows that the highest concentration of shootings occur in certain boroughs like the Bronx and Brooklyn neighborhoods. These boroughs were identified using the bar chart which displays the highest shooting incidents by borough. However, a heatmap offered a more visually pleasing view of exactly which areas of New York had the highest shooting incidents as viewers are able to see it spatially displayed on a map.
#RandomForest to predict death vs. non-death
set.seed(555)
#split data into training and test data set
split <- sample.split(shooting_data_cleaned$STATISTICAL_MURDER_FLAG, SplitRatio = 0.8)
data_train <- subset(shooting_data_cleaned, split == TRUE)
data_test <- subset(shooting_data_cleaned, split == FALSE)
data_train$STATISTICAL_MURDER_FLAG <- as.factor(data_train$STATISTICAL_MURDER_FLAG) #ensure STATISTICAL_MURDER_FLAG is a factor
data_test$STATISTICAL_MURDER_FLAG <- as.factor(data_test$STATISTICAL_MURDER_FLAG)
#train model with fixed seed so easy to reproduce
rf_model <- randomForest(
STATISTICAL_MURDER_FLAG ~ BORO + PERP_RACE + PERP_AGE_GROUP + VIC_RACE + VIC_AGE_GROUP +
PERP_SEX + VIC_SEX,
data = data_train,
importance = TRUE,
ntree = 1500, # increase number of trees to
nodesize = 10,
classwt = c(0.54, .46), # adjust weight because there are more non deaths than deaths to increase the accuracy of the model
keep.forest = TRUE,
do.trace = 100 #show every 100 OOB value
)
## ntree OOB 1 2
## 100: 41.75% 42.22% 39.83%
## 200: 42.07% 42.70% 39.53%
## 300: 42.41% 43.41% 38.36%
## 400: 42.29% 43.43% 37.70%
## 500: 42.37% 43.51% 37.80%
## 600: 42.30% 43.38% 37.93%
## 700: 42.19% 43.29% 37.77%
## 800: 42.00% 43.04% 37.80%
## 900: 42.05% 43.17% 37.54%
## 1000: 42.17% 43.22% 37.96%
## 1100: 42.07% 43.19% 37.57%
## 1200: 42.11% 43.15% 37.96%
## 1300: 42.07% 43.11% 37.93%
## 1400: 42.11% 43.21% 37.70%
## 1500: 42.15% 43.25% 37.73%
print(rf_model)
##
## Call:
## randomForest(formula = STATISTICAL_MURDER_FLAG ~ BORO + PERP_RACE + PERP_AGE_GROUP + VIC_RACE + VIC_AGE_GROUP + PERP_SEX + VIC_SEX, data = data_train, importance = TRUE, ntree = 1500, nodesize = 10, classwt = c(0.54, 0.46), keep.forest = TRUE, do.trace = 100)
## Type of random forest: classification
## Number of trees: 1500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 42.15%
## Confusion matrix:
## FALSE TRUE class.error
## FALSE 6963 5307 0.4325183
## TRUE 1152 1901 0.3773338
plot(rf_model)
#make predictions on the test data
rf_predictions <- predict(rf_model, newdata = data_test)
#plot prediction bar chart
# Combine predictions with actual values into a data frame
comparison_df <- data.frame(
Actual = data_test$STATISTICAL_MURDER_FLAG,
Predicted = rf_predictions
)
# Create a bar plot of predicted vs actual values
ggplot(comparison_df, aes(x = Actual, fill = Predicted)) +
geom_bar(position = "fill") +
labs(title = "Predicted vs Actual Proportions", y = "Proportion") +
scale_fill_manual(values = c("pink", "purple")) +
theme_minimal()
#get confusion matrix
conf_matrix <- table(Predicted = rf_predictions, Actual = data_test$STATISTICAL_MURDER_FLAG)
print(conf_matrix)
## Actual
## Predicted FALSE TRUE
## FALSE 1805 287
## TRUE 1262 476
#plot the factor importance
importance(rf_model)
## FALSE TRUE MeanDecreaseAccuracy MeanDecreaseGini
## BORO 4.566203 8.332650 8.804379 123.42041
## PERP_RACE -96.894825 112.149964 -76.355233 197.18183
## PERP_AGE_GROUP -87.898502 201.773839 -17.319437 506.95139
## VIC_RACE 1.315969 18.425863 9.791114 116.59023
## VIC_AGE_GROUP 10.374924 21.889063 21.009754 143.35430
## PERP_SEX -49.833382 47.385749 -48.613113 89.13826
## VIC_SEX -9.693797 4.963822 -6.526358 43.30935
varImpPlot(rf_model)
A random forest model was used because the STATISTICAL_MURDER_FLAG results in False or True values that were turned into factors and decision trees in the random forest model are good at working with categorical variables in a large dataset. From the model summary, we can see that the error for data that was not part of the training data was 42.15% (OOB). This is quite high so the model is not very good at predicting if a shooting incident resulted in a death as supported by the confusion matrix. The bar chart shows shows the predicted vs the actual proportions of the deaths but the model could be improved. Although efforts were made to improve the model, because of the high amount of non-deaths in the data set, the model became really good at predicting non-deaths but resulted in extremely high errors for predicting deaths.
Based on feature importance, we can see that Boro is the most important feature to predicting whether a shooting incident results in a death. Through previous visualizations, we saw how the Bronx and Brooklyn neighborhoods had high occurrances of shooting incidents. Perhaps with more research we are able to see if being involved in a shooting incident in one of these neighborhoods is more predictive of a death.
The other important feature is the Victim’s Age Group. It is possible that the younger that a person is, the higher chance that a shooting incident could result in death. This could be confirmed with further analysis.
This analysis of the NYPD Shooting Incident Data shows that although there was a decline in shooting incidents since 2006 and then a rise before declining again, there are borughs where shooting incidents are more concentrated than in other boroughs. These findings suggest that there could be more strategies to mitigate the number of shooting incidents by focusing on the high concentration areas. The findings also suggest that by creating an improved model for factors contributing to the STATISTICAL_MURDER_FLAG could lead to more support, funding, and education in certain communities to prevent deaths.
A personal bias that may have influenced my analysis is that I have viewed all of New York has a high crime area and therefore was not surprised when the map showed much of New York covered in shooting incidents. I mitigated for this by creating a heatmap that identifies the most concentrated areas of shooting incidents instead of just having the map of New York with all the shooting incidents marked on it.